Back

BMC Bioinformatics

Springer Science and Business Media LLC

All preprints, ranked by how well they match BMC Bioinformatics's content profile, based on 383 papers previously published here. The average preprint has a 0.37% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
ClusTrast: a short read de novo transcript isoform assembler guided by clustered contigs

Westrin, K. J.; Kretzschmar, W. W.; Emanuelsson, O.

2022-01-03 bioinformatics 10.1101/2022.01.02.473666 medRxiv
Top 0.1%
66.5%
Show abstract

BackgroundTranscriptome assembly from RNA-sequencing data in species without a reliable reference genome has to be performed de novo, but studies have shown that de novo methods often have inadequate ability to reconstruct transcript isoforms. We address this issue by constructing an assembly pipeline whose main purpose is to produce a comprehensive set of transcript isoforms. ResultsWe present the de novo transcript isoform assembler ClusTrast, which takes short read RNA-seq data as input, constructs a primary assembly, clusters a set of guiding contigs, aligns the short reads to the guiding contigs, assembles each clustered set of short reads individually, and merges the primary and clusterwise assemblies into the final assembly. We tested ClusTrast on real datasets from six eukaryotic species, and showed that Clus-Trast reconstructed more expressed known isoforms than any of the other tested de novo assemblers, at a moderate reduction in precision. For recall, ClusTrast was on top in the lower end of expression levels (<15% percentile) for all tested datasets, and over the entire range for almost all datasets. Reference transcripts were often (35-69% for the six datasets) reconstructed to at least 95% of their length by ClusTrast, and more than half of reference transcripts (58-81%) were reconstructed with contigs that exhibited polymorphism, measuring on a subset of reliably predicted contigs. ClusTrast recall increased when using a union of assembled transcripts from more than one assembly tool as primary assembly. ConclusionWe suggest that ClusTrast can be a useful tool for studying isoforms in species without a reliable reference genome, in particular when the goal is to produce a comprehensive transcriptome set with polymorphic variants.

2
A single-cell clusters similarity measure for different batches, datasets, and samples

Gonzalez-Velasco, O.; Sanchez-Luis, E.; De La Rosa, E.; Sanchez-Santos, J. M.; De Las Rivas, J.

2022-03-18 bioinformatics 10.1101/2022.03.14.483731 medRxiv
Top 0.1%
59.5%
Show abstract

SummarySince the inception of single-cell level measuring techniques, identification of distinct cell stages, phenotypes and populations has been a challenge. Cell clustering and dimensionality reduction methods are the most popular approaches to identify heterogeneity of single-cell data. But, as public repositories continue to grow in number, integrative analyses and merging of large pools of samples from different and heterogeneous datasets becomes a difficult challenge, which showcases the impossibility of scalability of some of the existing methods. Here we present ClusterFoldSimilarity, an R package that calculates a measure of similarity between clusters from different datasets/batches, without the need of correcting for batch effect or normalizing and merging the data, thus avoiding artifacts and the loss of information derived from these kinds of techniques. The similarity metric is based on the average vector module and sign of the product of logarithmic fold-changes. ClusterFoldSimilarity compares every single pair of clusters from any number of different samples/datasets, including different number of clusters for each sample. Additionally, the algorithm is able to select the top genes which contribute the most to the similarity of two specific clusters, serving also as a feature selection tool. Availability and implementationThe algorithm is freely available as an R package at: https://github.com/OscarGVelasco/ClusterFoldSimilarity Contactoscargvelasco@gmail.com

3
BLASE: Bulk Linkage Analysis for Single Cell Experiments - Teasing Out the Secrets of Bulk Transcriptomics with Trajectory Analysis

McCluskey, A.; Kettlewell, T.; Smith, A. M.; Kundu, R.; Gunn, D. A.; Otto, T. D.

2025-09-07 bioinformatics 10.1101/2025.09.03.673925 medRxiv
Top 0.1%
54.7%
Show abstract

1MotivationscRNA-seq experiments can capture cell process trajectories. Bulk RNA-seq is more practical, however does not have the granularity to elucidate cell-type specific trajectories. Deconvolution methods can estimate cell-types in RNA-seq data, but there is a need for methods characterising their pseudotime. ResultsWe show that our method, BLASE, can identify the progress of an RNA-seq sample through a trajectory in a scRNA-seq reference. ConclusionBLASE can be used to a) annotate scRNA-seq data from existing RNA-seq, b) identify progress of RNA-seq data through a process based on scRNA-seq data, and c) be used to correct developmental differences in RNA-seq differential expression analysis.

4
ExtendAlign: the post-analysis tool to correct and improve the alignment of dissimilar short sequences

Flores-Torres, M.; Gomez-Romero, L.; Hasse-Hernandez, J. I.; Aguilar-Ordonez, I.; Tovar, H.; Avendano-Vazquez, S. E.; Flores-Jasso, C. F.

2019-08-16 bioinformatics 10.1101/475707 medRxiv
Top 0.1%
53.6%
Show abstract

In this work, we evaluated several tools used for the alignment of short sequences and found that most aligners execute reasonably well for identical sequences, whereas a variety of alignment errors emerge for dissimilar ones. Since alignments are essential in computational biology, we developed ExtendAlign, a post-analysis tool that corrects these errors and improves the alignment of dissimilar short sequences. We used simulated and biological data to show that ExtendAlign outperforms the other aligners in most metrics tested. ExtendAlign is useful for pinpointing the identity percentage for alignments of short sequences in the range of [~]35-50% similarity.

5
dna-parser: a Python library written in Rust for fast encoding of DNA and RNA sequences

Vilain, M.; Aris-Brosou, S.

2026-01-21 bioinformatics 10.64898/2026.01.20.700656 medRxiv
Top 0.1%
53.4%
Show abstract

BackgroundThe ever-growing amount of available biological data leads modern analysis to be performed on large datasets. Unfortunately, bioinformatics tools for preprocessing and analyzing data are not always designed to treat such large amounts of data efficiently. Notably, this is the case when encoding DNA and RNA sequences into numerical representations, also called descriptors, before passing them to machine learning models. Furthermore, current Python tools available for this preprocessing step are not well suited to be integrated into pipelines resulting in slow encoding speeds. ResultsWe introduce dna-parser, a Python library written in Rust to encode DNA and RNA sequences into numerical features. The combination of Rust and Python allows to encode sequences rapidly and in parallel across multiple threads while maintaining compatibility with packages from the Python ecosystem. Moreover, this library implements many of the most widely used types of numerical feature schemes coming from bioinformaticss and natural language processing. Conclusiondna-parser is an easy to install Python library that offers many Python wheels for Linux (muslinux and manylinux), macOS, and Windows via pip (https://pypi.org/project/dna-parser/). The open source code is available on GitHub (https://github.com/Mvila035/dna_parser) along with the documentation (https://mvila035.github.io/dna_parser/documentation/).

6
When do longer reads matter? A benchmark of long read de novo assembly tools for eukaryotic genomes

Cosma, B. M.; Shirali Hossein Zade, R.; Jordan, E. N.; van Lent, P.; Peng, C.; Pillay, S.; Abeel, T.

2023-02-02 bioinformatics 10.1101/2023.01.30.526229 medRxiv
Top 0.1%
53.2%
Show abstract

BackgroundAssembly algorithm choice should be a deliberate, well-justified decision when researchers create genome assemblies for eukaryotic organisms from third-generation sequencing technologies. While third-generation sequencing by Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) have overcome the disadvantages of short read lengths specific to next-generation sequencing (NGS), third-generation sequencers are known to produce more error-prone reads, thereby generating a new set of challenges for assembly algorithms and pipelines. Since the introduction of third-generation sequencing technologies, many tools have been developed that aim to take advantage of the longer reads, and researchers need to choose the correct assembler for their projects. ResultsWe benchmarked state-of-the-art long-read de novo assemblers, to help readers make a balanced choice for the assembly of eukaryotes. To this end, we used 13 real and 72 simulated datasets from different eukaryotic genomes, with different read length distributions, imitating PacBio CLR, PacBio HiFi, and ONT sequencing to evaluate the assemblers. We include five commonly used long read assemblers in our benchmark: Canu, Flye, Miniasm, Raven and Redbean. Evaluation categories address the following metrics: reference-based metrics, assembly statistics, misassembly count, BUSCO completeness, runtime, and RAM usage. Additionally, we investigated the effect of increased read length on the quality of the assemblies, and report that read length can, but does not always, positively impact assembly quality. ConclusionsOur benchmark concludes that there is no assembler that performs the best in all the evaluation categories. However, our results shows that overall Flye is the best-performing assembler, both on real and simulated data. Next, the benchmarking using longer reads shows that the increased read length improves assembly quality, but the extent to which that can be achieved depends on the size and complexity of the reference genome.

7
FindBacksplice: a Tool for Locating Circular RNA Backsplice Coordinates

Kraljevic, M.; Ozturk, A.; Jank, M.; LeDuc, R. D.; Keijzer, R.

2025-09-13 bioinformatics 10.1101/2025.09.08.674962 medRxiv
Top 0.1%
52.1%
Show abstract

Circular RNAs (circRNAs) are generated through back-splicing, a process where a backsplice junction (BSJ) is formed, based on the circRNAs unique sequence. BSJs are highly conserved and can be mapped to chromosomal coordinates. Current platforms determine these coordinates from bulk RNA sequencing data. We aimed to develop a tool capable of determining backsplice coordinates based on BSJ sequences for any species and genome version. MotivationcircRNAs are emerging as an important regulator of cellular differentiation and other biologically important processes. Common tools for circRNA analyses require a circular RNAs backsplice coordinates. These can be accessed from public databases. However, the coordinates are specific to a version of a species genome, and are unavailable for many model organisms. ResultsWe have developed a Python-based, command line tool, FindBacksplice, which produces backsplice coordinates for any available genome, based on a circRNAs BSJ sequence. Implemented in Python, this script is integrated with BLAST for use in existing pipelines. We were able to find valid locations of backsplices for known human BSJs in the rat genome and produce backsplice coordinates for use in existing pipelines. Availability and implementationFindBacksplice is available at github.com/m-kraljevic/findbacksplice

8
Tailored Graphical Lasso for Data Integration in Gene Network Reconstruction

Lingjaerde, C.; Lien, T. G.; Borgan, O.; Glad, I. K.

2020-12-30 bioinformatics 10.1101/2020.12.29.424744 medRxiv
Top 0.1%
51.2%
Show abstract

BackgroundIdentifying gene interactions is a topic of great importance in genomics, and approaches based on network models provide a powerful tool for studying these. Assuming a Gaussian graphical model, a gene association network may be estimated from multiomic data based on the non-zero entries of the inverse covariance matrix. Inferring such biological networks is challenging because of the high dimensionality of the problem, making traditional estimators unsuitable. The graphical lasso is constructed for the estimation of sparse inverse covariance matrices in Gaussian graphical models in such situations, using L1-penalization on the matrix entries. An extension of the graphical lasso is the weighted graphical lasso, in which prior biological information from other (data) sources is integrated into the model through the weights. There are however issues with this approach, as it naively forces the prior information into the network estimation, even if it is misleading or does not agree with the data at hand. Further, if an associated network based on other data is used as the prior, weighted graphical lasso often fails to utilize the information effectively. ResultsWe propose a novel graphical lasso approach, the tailored graphical lasso, that aims to handle prior information of unknown accuracy more effectively. We provide an R package implementing the method, tailoredGlasso. Applying the method to both simulated and real multiomic data sets, we find that it outperforms the unweighted and weighted graphical lasso in terms of all performance measures we consider. In fact, the graphical lasso and weighted graphical lasso can be considered special cases of the tailored graphical lasso, and a parameter determined by the data measures the usefulness of the prior information. With our method, mRNA data are demonstrated to provide highly useful prior information for protein-protein interaction networks. ConclusionsThe method we introduce utilizes useful prior information more effectively without involving any risk of loss of accuracy should the prior information be misleading.

9
Bias in miRNA enrichment analysis related to gene functional annotations

Zagganas, K.; Georgakilas, G. K.; Vergoulis, T.; Dalamagas, T.

2021-08-17 bioinformatics 10.1101/2021.08.16.456527 medRxiv
Top 0.1%
51.2%
Show abstract

BackgroundmiRNA functional enrichment is a type of analysis that is used to predict which biological functions may be affected by a group of miRNAs or validate whether a list of dysregulated miRNAs are linked to a diseased state. The standard method for functional enrichment analysis uses the hypergeometric distribution to produce p-values, depicting the strength of the association between a group of miRNAs and a biological function. However, in 2015, it was shown that this approach suffers from a bias related to miRNA targets produced by target prediction algorithms and a new randomization test was proposed to alleviate this issue. ResultsWe demonstrate the existence of another previously unreported underlying bias which affects gene annotation data sets; additionally, we show that the statistical measure used for the established randomization test is not sensitive enough to account for it. In this context, we show that the use of Jaccard coefficient (an alternative statistical measure) is able to alleviate the aforementioned issue. ConclusionsIn this paper, we illustrate the existence of a new bias affecting the miRNA functional enrichment analysis. This bias makes Fishers exact test unsuitable for miRNA functional enrichment analyses and there is also a need to adjust the established unbiased test accordingly. We propose the use of a modified version of the established test and in order to facilitate its use, we introduce a novel unbiased miRNA enrichment analysis tool that implements the proposed method. At the same time, by leveraging bit vectors, our tool guarantees fast and scalable execution. AvailabilityAll datasets used in the experiments throughout this paper are openly accessible on Zenodo (https://doi.org/10.5281/zenodo.5175819).

10
PLO(SC)2: Plots and Scripts for scRNA-seq analysis

Joppich, M.

2025-03-14 bioinformatics 10.1101/2025.03.09.642205 medRxiv
Top 0.1%
51.0%
Show abstract

BackgroundscRNA-seq analysis has become a standard technique for studying biological systems. As costs decrease, scRNA-seq experiments become increasingly complex. While typical scRNA-seq analysis frameworks provide basic functionality to analyze such data sets, downstream analysis and visualization become a bottleneck. Standard plots are not always suitable to provide specific insight into such complex data sets and should be extended to provide camera-ready, meaningful plots. ResultsWith PLO(SC)2, a collection of plotting and analysis scripts for use in Seurat-based scRNA-seq data analyses is presented, which are accessible for custom script-based analyses or within an R shiny app. The analysis scripts mainly provide a collection of code blocks which enable a comfortable basic analysis of scRNA-seq data from Seurat object creation, filtering, and over data set integration in less than 10 function calls. Subsequently, code blocks for performing differential and enrichment analyses and corresponding visualizations are provided. Finally, several enhanced visualizations are provided, such as the enhanced Heatmap, DotPlot and comparative Box-/Violin plots. These, particularly, allow the user to specify how the shown values should be scaled, allowing the accurate creation of condition-wise plots. ConclusionWith the PLO(SC)2 framework data analysis of scRNA-seq experiments is performed more comfortable and stream-lined, while visualizations are enhanced to be suitable for interpreting complex datasets. The PLO(SC)2 scripts are available from GitHub and include a vignette showing how PLO(SC)2 is applied within a script-based analysis, as well as an R shiny app.

11
barbieQ: An R software package for analysing barcode count data from clonal tracking experiments

Fei, L.; Maksimovic, J.; Oshlack, A.

2025-09-14 bioinformatics 10.1101/2025.09.11.675529 medRxiv
Top 0.1%
50.9%
Show abstract

MotivationA "clone" encompasses a progenitor cell and its progeny cells. Tracking clonal composition as cells differentiate or evolve is useful in many fields. Various single-cell lineage tracing (clonal tracking) technologies use unique DNA barcodes that are passed from progenitor cells to their offspring. The barcode count for each sample indicates cell number in clones. However, analysis of barcode count data is often bespoke and relies on visualisations and heuristics. A generalized workflow for preprocessing and robust statistical analysis of barcode count data across protocols is needed. ResultsWe introduce barbieQ, a Bioconductor R package for analysing barcode count data across groups of samples. It provides data-driven quality control and filtering, extensive visualisations, and two statistical tests: 1) Differential barcode proportion (differences in proportions between sample groups), and 2) Differential barcode occurrence (differences in presence/absence odds between groups). Both tests handle complex experimental designs using regression models and rigorously account for sample-to-sample variability. We validated both tests on semi-simulated, real data and a case study, demonstrating that they hold their size, are sufficiently powered to detect true differences, and outperform existing approaches.

12
Evaluation and Aggregation of Active Module Identification Algorithms

Liu, J.; Xu, M.; Xing, J.

2025-10-07 bioinformatics 10.1101/2025.10.06.680790 medRxiv
Top 0.1%
49.3%
Show abstract

BackgroundHigh-throughput sequencing methods have generated vast amounts of genetic data for candidate gene studies. As a part of the analysis, candidate genes are often analyzed through Gene-Gene interaction (GGI) networks. These networks can become very large, necessitating efficient methods to reduce their complexity. Active Module Identification (AMI) is a common method to analyze GGI networks by identifying enriched subnetworks representing relevant biological processes. Multiple AMI algorithms have been developed for biological datasets, and a comprehensive assessment of these algorithms and a comparative analysis of their behaviors across a variety of use-cases are crucial for their appropriate applications. ResultsIn this study, we use the Empirical Pipeline (EMP) to evaluate four AMI algorithms - PAPER, DOMINO, FDRnet, and HotNet2 - on their ability to produce context-specific enrichment. When testing the algorithms on four biological datasets, our results reveal that no single algorithm outperforms the others across all datasets. Moreover, the output modules are often dissimilar, suggesting that different algorithms capture complementary biological signals. Our results suggest that a comprehensive analysis requires the aggregation of outputs from multiple algorithms. We propose two methods to this end: a spectral clustering approach for module aggregation, and an algorithm that combines modules with similar network structures called Greedy Conductance-based Merging (GCM). ConclusionsOverall, our results advance our understanding of AMI algorithms and how they should be applied. Tools and workflows developed in this study will facilitate researchers working with AMI algorithms to enhance their analyses. Our code is freely available at https://github.com/LiuJ0/AMI-Benchmark/.

13
s-aligner: a greedy algorithm for non-greedy de novo genome assembly

Bermudez, J.

2021-02-02 bioinformatics 10.1101/2021.02.02.429443 medRxiv
Top 0.1%
47.1%
Show abstract

Genome assembly is a fundamental tool for biological research. Particularly, in microbiology, where budgets per sample are often scarce, it can make the difference between an inconclusive result and a fully valid conclusion. Identifying new strains or estimating the relative abundance of quasi-species in a sample are some example tasks that cant be properly accomplished without previously generating assemblies with little structure ambiguity and covering most of the genome. In this work, we present a new genome assembly tool based on a greedy strategy. We compare the results obtained applying this tool to the results obtained with previously existing software. We find that, when applied to viral studies, comparatively, the software we developed often gets far larger contigs and higher genome fraction coverage than previous software. We also find a significant advantage when applied to exceptionally large virus genomes.

14
Accelign: a GPU-based Library for Accelerating Pairwise Sequence Alignment

Kallenborn, F.; Dabbaghie, F.; Steinegger, M.; Schmidt, B.

2025-12-19 bioinformatics 10.64898/2025.12.17.694868 medRxiv
Top 0.1%
45.5%
Show abstract

BackgroundThe continually increasing volume of sequence data results in a growing demand for fast implementations of core algorithms. Computation of pairwise alignments based on dynamic programming is an important part in many bioinformatics pipelines and a major contributor to overall runtime due to the associated quadratic time complexity. This motivates the need for a library of efficient implementations on modern GPUs for a variety of alignment algorithms for different types of sequence data including DNA, RNA, and proteins. ResultsAccelign is a library of accelerated pairwise sequence alignment algorithms for CUDA-enabled GPUs. Its parallelization strategy is based on a common wavefront design that can be adapted to support a variety of dynamic programming algorithms: local, global, and semi-global alignment of genomic and protein sequences with a variety of commonly used scoring schemes supporting one-to-one, one-to-many or all-to-all pairwise sequence alignments. This leads to a peak performance between 16.1 TCUPS and 9.1 TCUPS for computing optimal global alignment scores with linear gaps and affine gap penalties on a single RTX PRO 6000 Blackwell GPU, respectively. In addition, our library demonstrates significant speedups in several real-world case studies over prior CPU-based (SeqAn, Parasail, BSalign) and GPU-based libraries (ADEPT, GASAL2), and can even outperform highly customized algorithms (WFA-GPU, CUDASW++4.0). Furthermore, the performance of our approach scales linearly with the number of employed GPUs, which makes it feasible to exploit multi-GPU nodes for increased processing speeds. ConclusionAccelign provides significant speedups for commonly used pairwise alignment algorithms compared to prior implementations. It is freely available at https://github.com/fkallen/Accelign.

15
Workstation benchmark of Spark Capable Genome Analysis ToolKit 4 Variant Calling

Hansen, M. H.; Simonsen, A. T.; Ommen, H. B.; Nyvold, C. G.

2020-05-19 bioinformatics 10.1101/2020.05.17.101105 medRxiv
Top 0.1%
44.5%
Show abstract

BackgroundRapid and practical DNA-sequencing processing has become essential for modern biomedical laboratories, especially in the field of cancer, pathology and genetics. While sequencing turn-over time has been, and still is, a bottleneck in research and diagnostics, the field of bioinformatics is moving at a rapid pace - both in terms of hardware and software development. Here, we benchmarked the local performance of three of the most important Spark-enabled Genome analysis toolkit 4 (GATK4) tools in a targeted sequencing workflow: Duplicate marking, base quality score recalibration (BQSR) and variant calling on targeted DNA sequencing using a modest hyperthreading 12-core single CPU and a high-speed PCI express solid-state drive. ResultsCompared to the previous GATK version the performance of Spark-enabled BQSR and HaplotypeCaller is shifted towards a more efficient usage of the available cores on CPU and outperforms the earlier GATK3.8 version with an order of magnitude reduction in processing time to analysis ready variants, whereas MarkDuplicateSpark was found to be thrice as fast. Furthermore, HaploTypeCallerSpark and BQSRPipelineSpark were significantly faster than the equivalent GATK4 standard tools with a combined [~]86% reduction in execution time, reaching a median rate of ten million processed bases per second, and duplicate marking was reduced [~]42%. The called variants were found to be in close agreement between the Spark and non-Spark versions, with an overall concordance of 98%. In this setup, the tools were also highly efficient when compared execution on a small 72 virtual CPU/18-node Google Cloud cluster. ConclusionIn conclusion, GATK4 offers practical parallelization possibilities for DNA sequence processing, and the Spark-enabled tools optimize performance and utilization of local CPUs. Spark utilizing GATK variant calling is several times faster than previous GATK3.8 multithreading with the same multi-core, single CPU, configuration. The improved opportunities for parallel computations not only hold implications for high-performance cluster, but also for modest laboratory or research workstations for targeted sequencing analysis, such as exome, panel or amplicon sequencing.

16
Diviner uncovers hundreds of novel human (and other) exons though comparative analysis of proteins

Wheeler, T.; Nord, A. J.

2024-05-05 bioinformatics 10.1101/2024.05.05.592595 medRxiv
Top 0.1%
44.2%
Show abstract

BackgroundEukaryotic genes are often composed of multiple exons that are stitched together by splicing out the intervening introns. These exons may be conditionally joined in different combinations to produce a collection of related, but distinct, mRNA transcripts. For protein-coding genes, these products of alternative splicing lead to production of related protein variants (isoforms) of a gene. Complete labeling of the protein-coding content of a eukaryotic genome requires discovery of mRNA encoding all isoforms, but it is impractical to enumerate all possible combinations of tissue, developmental stage, and environmental context; as a result, many true exons go unlabeled in genome annotations. ResultsOne way to address the combinatoric challenge of finding all isoforms in a single organism A is to leverage sequencing efforts for other organisms - each time a new organism is sequenced, it may be under a new combination of conditions, so that a previously unobserved isoform may be sequenced. We present Diviner, a software tool that identifies previously undocumented exons in organisms by comparing isoforms across species. We demonstrate Diviners utility by locating hundreds of novel exons in the genomes of human, mouse, and rat, as well as in the ferret genome. Further, we provide analyses supporting the notion that most of the new exons reported by Diviner are likely to be part of a true (but unobserved) isoform of the containing species.

17
Block aligner: fast and flexible pairwise sequence alignment with SIMD-accelerated adaptive blocks

Liu, D.; Steinegger, M.

2021-11-08 bioinformatics 10.1101/2021.11.08.467651 medRxiv
Top 0.1%
43.2%
Show abstract

BackgroundThe Smith-Waterman-Gotoh alignment algorithm is the most popular method for comparing biological sequences. Recently, Single Instruction Multiple Data methods have been used to speed up alignment. However, these algorithms have limitations like being optimized for specific scoring schemes, cannot handle large gaps, or require quadratic time computation. ResultsWe propose a new algorithm called block aligner for aligning nucleotide and protein sequences. It greedily shifts and grows a block of computed scores to span large gaps within the aligned sequences. This greedy approach is able to only compute a fraction of the DP matrix. In exchange for these features, there is no guarantee that the computed scores are accurate compared to full DP. However, in our experiments, we show that block aligner performs accurately on various realistic datasets, and it is up to 9 times faster than the popular Farrars algorithm for protein global alignments. ConclusionsOur algorithm has applications in computing global alignments and X-drop alignments on proteins and long reads. It is available as a Rust library at https://github.com/Daniel-Liu-c0deb0t/block-aligner.

18
R.ROSETTA: an interpretable machine learning framework

Garbulowski, M.; Diamanti, K.; Smolinska, K.; Baltzer, N.; Stoll, P.; Bornelov, S.; Ohrn, A.; Feuk, L.; Komorowski, J.

2020-06-10 bioinformatics 10.1101/625905 medRxiv
Top 0.1%
43.1%
Show abstract

MotivationFor machine learning to matter beyond intellectual curiosity, the models developed therefrom must be adopted within the greater scientific community. In this study, we developed an interpretable machine learning framework that allows identification of semantics from various datatypes. Our package can analyze and illuminate co-predictive mechanisms reflecting biological processes. ResultsWe present R.ROSETTA, an R package for building and analyzing interpretable machine learning models. R.ROSETTA gathers combinatorial statistics via rule-based modelling for accessible and transparent results, well-suited for adoption within the greater scientific community. The package also provides statistics and visualization tools that facilitate minimization of analysis bias and noise. Investigating case-control studies of autism, we showed that our tool provided hypotheses for potential interdependencies among features that discerned phenotype classes. These interdependencies regarded neurodevelopmental and autism-related genes. Although our sample application of R.ROSETTA was used for transcriptomic data analysis, R.ROSETTA works perfectly with any decision-related omics data. AvailabilityThe R.ROSETTA package is freely available at https://github.com/komorowskilab/R.ROSETTA. Contactmateusz.garbulowski@icm.uu.se (Mateusz Garbulowski), jan.komorowski@icm.uu.se (Jan Komorowski)

19
Choosing representative proteins based on splicing structure similarity improves the accuracy of gene tree reconstruction

Kuitche Kamela, E.; Degen, M.; Wang, S.; Ouangraoua, A.

2020-04-10 bioinformatics 10.1101/2020.04.09.034785 medRxiv
Top 0.1%
42.4%
Show abstract

Constructing accurate gene trees is important, as gene trees play a key role in several biological studies, such as species tree reconstruction, gene functional analysis and gene family evolution studies. The accuracy of these studies is dependent on the accuracy of the input gene trees. Although several methods have been developed for improving the construction and the correction of gene trees by making use of the relationship with a species tree in addition to multiple sequence alignment, there is still a large room for improvement on the accuracy of gene trees and the computing time. In particular, accounting for alternative splicing that allows eukaryote genes to produce multiple transcripts/proteins per gene is a way to improve the quality of multiple sequence alignments used by gene tree reconstruction methods. Current methods for gene tree reconstruction usually make use of a set of transcripts composed of one representative transcript per gene, to generate multiple sequence alignments which are then used to estimate gene trees. Thus, the accuracy of the estimated gene tree depends on the choice of the representative transcripts. In this work, we present an alternative-splicing-aware method called Splicing Homology Transcript (SHT) method to estimate gene trees based on wisely selecting an accurate set of homologous transcripts to represent the genes of a gene family. We introduce a new similarity measure between transcripts for quantifying the level of homology between transcripts by combining a splicing structure-based similarity score with a sequence-based similarity score. We present a new method to cluster transcripts into a set of splicing homology groups based on the new similarity measure. The method is applied to reconstruct gene trees of the Ensembl database gene families, and a comparison with current EnsemblCompara gene trees is performed. The results show that the new approach improves gene tree accuracy thanks to the use of the new similarity measure between transcripts. An implementation of the method as well as the data used and generated in this work are available at https://github.com/UdeS-CoBIUS/SplicingHomologGeneTree/.

20
Meta-Align: A Novel HMM-based Algorithm for Pairwise Alignment of Error-Prone Sequencing Reads

TOMII, K.; Kumar, S.; Zhi, D.; Brenner, S. E.

2020-05-12 bioinformatics 10.1101/2020.05.11.087676 medRxiv
Top 0.1%
42.2%
Show abstract

BackgroundInsertion and deletion sequencing errors are relatively common in next-generation sequencing data and produce long stretches of mistranslated sequence. These frameshifting errors can cause very serious damages to downstream data analysis of reads. However, it is possible to obtain more precise alignment of DNA sequences by taking into account both coding frame and sequencing errors estimated by quality scores. ResultsHere we designed and proposed a novel hidden Markov model (HMM)-based pairwise alignment algorithm, Meta-Align, that aligns DNA sequences in the protein space, incorporating quality scores from the DNA sequences and allowing frameshifts caused by insertions and deletions. Our model is based on both an HMM transducer of a pair HMM and profile HMMs for all possible amino acid pairs. A Viterbi algorithm over our model produces the optimal alignment of a pair of metagenomic reads taking into account all possible translating frames and gap penalties in both the protein space and the DNA space. To reduce the sheer number of states of this model, we also derived and implemented a computationally feasible model, leveraging the degeneracy of the genetic code. In a benchmark test on a diverse set of simulated reads based on BAliBASE we show that Meta-Align outperforms TBLASTX which compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database using the BLAST algorithm. We also demonstrate the effects of incorporating quality scores on Meta-Align. ConclusionsMeta-Align will be particularly effective when applied to error-prone DNA sequences. The package of our software can be downloaded at https://github.com/shravan-repos/Metaalign.